-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for UFloat in PintArray (#139) #140
base: master
Are you sure you want to change the base?
Add support for UFloat in PintArray (#139) #140
Conversation
Signed-off-by: MichaelTiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
As noted, this change request may precipitate changes needed in how Pint deals with uncertainties (whose representation is not supported by vanilla Python parsers). |
This lets you store ufloat objects, but I'm not sure how well pandas will for operations like reshaping, indexing etc. I suspect some will work but some won't. I think you can add ufloats to fixtures in test_pandas_extensiontests and it'll run with standard and ufloat quantities. This should show what works and doesn't. I think you should make it so that it only stores floats or ufloats in a PintArray. There's a few points where arrays are expected to contain the same object in them, and I think pandas expects each extensiontype to only store one type of object. I also think (for the time being) you should make it only allow floats or ufloats globally, as there's no way to distinguish between a pint[m] (float) or pint[m] (ufloat) type. This is a wider issue that also applies to ints or other numeric types. |
I'm relatively new to all the pytest stuff in pint/pandas, etc. Am I understanding correctly that I should modify
to ufloat-based values? I'm going to give that a try, but if there's a better way to do it (or a you meant something different) please guide me. Thanks! |
Progress...I've now got 203 failures, the vast majority of which are due to the fact that ufloats that are computed to be the same value, but not the same variable, do not compare as equal, so all the I do not know how to make progress on this:
Also: 191 passed, 1 skipped, 69 xfailed, 5 xpassed |
yes like that, there are some fixtures that return a value from a list, like all_arithmetic_operators. I'd try changing these so they test both ufloat and float. |
Use this fork of pandas for the assert...equal tets to work pint-pandas/.github/workflows/ci.yml Line 11 in 236cf0f
edit: I'm not 100% sure this will work, it makes the tests work for quantity but you might have a different issue there |
that's coming from a newer pandas with extra tests compared pandas 1.4x, you can rebase from my #133 if you're using a higher version... but for now I wouldn't worry about those errors |
More progress...down to essentially one kind of failure--the un-hashabiliity of UFloat. I don't have the Python sophistication to know what the next step would be, but I'll push the changes I have thus far.
|
Does it make sense to rebase PintArrays on uarrays (which can be tuned to use |
I'll give that a try. While I'm not sure it's toally necessary, I think it's a good exercise.
It will certainly keep the discussion with Pandas on point to present uniform ExtensionTypes and to keep both Pint and Pandas on the straight-and-narrow there. |
you can try, it's easy to check. I don't know what effects that'd have. Numpy is used accross pandas so would still be used. |
Big remaining problem is unhanshable UFloat type. Signed-off-by: MichaelTiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Signed-off-by: 72577720+MichaelTiemannOSC@users.noreply.github.com
@@ -24,6 +33,120 @@ | |||
|
|||
ureg = PintType.ureg | |||
|
|||
import pandas._testing as tm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A note about these changes: when comparing UFloats, if the .s (std_dev) terms zero, then two numbers with the same .n (nominal_value) terms will compare equal. However, if there are uncertainties in the numbers, even identical uncertainties, they will not compare equal unless they are identically the same uncertain variable. Since the test suite typically finds two ways to do the same thing and then compare the results, this leads to UFloats comparing not equal unless the uncertainties are zero.
I started making chnages to try to compare UFloats by comparing their .n and .s separately (not using ==), but this is a real hassle, as there are many, many cases to test and unpack. For my own sanity, I simplified the task by using only UFloats with uncertainty values of zero, allowing me to test float vs. UFloat handling without the spurious problems with ufloat(a) =?= ufloat(b).
I've left the code I started writing for the more general relaxation of ufloat comparison, but it shouldn't actually be needed if we keep our unfloat uncertainty at zero (which is presently the case). So we can either remove, improve, or defer dealing with uassert_equal and friends. LMK
Signed-off-by: MichaelTiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Signed-off-by: MichaelTiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Big remaining problem is unhanshable UFloat type. Signed-off-by: MichaelTiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Signed-off-by: 72577720+MichaelTiemannOSC@users.noreply.github.com
Signed-off-by: MichaelTiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
…C/pint-pandas into ducks-unlimited Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
I thought it would be helpful to me to merge in changes that have been merged upstream, not realizing how that would affect the size of the PR. If I should revert to the previous version so that the PR has only my changes, I'll do that. I'm still learning my way around git, and everything new I do is truly a "learning experience". |
Deal with duality of np.nan and _ufloat_nans when constructing PintArrays, creating float64 magnitude arrays when possible and ufloat magnitudes when necessary. Note that PintArray columns don't advertise whether magnitudes are np.float64 or np.obejct; they are what they are. Also commented out warning, which simply creates noise. Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Fix conditional expression. Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Don't try to build a large-enough-to-hold-uncertainties array if we are not using uncertainties in any way. Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Allow pip install command to unpack uncertainties option into multiple arguments. Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Thanks @hgrecco for creating Pint-0.23rc0. I've attempted to write some rules to optionally test uncertainties, but I don't know the right way to leave the normal Pint-Pandas matrix to do its thing, and then create an uncertainties test that uses Pandas-2.1.0 and Pint-0.23rc0. The approach I took simply overrode Pandas and Pint every time. Maybe I need to add a null string "" and then test for the null string? My yaml/shell scripting is fairly primitive. In any case, if I can get the CI/CD working as it should, then these changes should be ready to go for a pint_pandas-0.6rc0. |
By using `include` directive, we can add a special case to the matrix of possibilities. In this case we add to the matrix the case where pandas>=2.1.0 and pint>=0.23rc0, which are both necessary for the UFloat array changes to work. Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Don't test uncertainties in ci-pint-pre or ci-pint-master, as these don't have the necessary versions of Pandas or Pint to make the matrix include operation work. Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Also test Pint-Pandas without uncertainties. Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Trying again with different syntax to avoid duplicate key issue. Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Trying yet another syntax to make `uncertainties` one of two options when pandas==2.1.0 and pint==0.23rc0. Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Try using null instead of "" for uncertainties that we don't want to run. Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Try using "null" to represent a condition where we don't want uncertainties installed. Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Try using "no-thank-you" instead of "null" for matrix permutations not intended to use uncertainties. Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Wrap `uncertainties` matrix parameter in `[]`. If we get this working, can try setting parameter to `''`' from several commits ago. Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Use single quotes when testing the value of `uncertainties`. Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Tweak conditional matrix logic by using a "same-but-different" value for `pint` value assignment. Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Untabify yaml file... Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Make uncertainties a default null value in the matrix. Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Attempt to explicitly add python-version and numpy to `uncertainties` case. Otherwise, pytest never gets installed. Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
We cannot reference the matrix when building the matrix, so unroll all parameter values. Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Remove unused `uncertainties` from -pre and -master CI files. Only test one configuration of Python against uncertainties (trying to work around duplicate key problem). Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
We now have CI/CD testing without uncertainties across a matrix, and additionally testing uncertainties in one specific matrix case (Python 3.9, Pandas 2.1.0, Pint 0.23rc0). That should make us good to go for an 0.6rc0, no? |
Nudge @andrewgsavage |
Sorry whenever I look at this I see that uses a numpy object type array, which I didn't expect to work... and get hung up thinking about well it works. There aren't any examples using an array of objects, so although it works, it's not really supported. Merging this implies support, which I'm not really sure is a good idea. The code wasnt written with that in mind, and other than your issue test there's no testing for this a mix of different objects in numpy array.
reading this through again, that's kind of what I meant by making a seperate ExtenisonArray for uncertainties, say UncertaintyArray. The PintArray can then use the UncertaintyArray for the data, as it does with other pandas EAs. The extra logic needed to handle uncertainties you've written in this PR can be moved to the UA and the PintArray can deal with the units and delegate the rest to the UA magnitude (like how pint's Quantity does so) I'm not sure how well a UA would work with pint-pandas (it may still need some code to handle uncertainties) but would like to see how well that works before deciding on this. |
Me too 😆
Having spent a fair bit of time inside the Pandas code, especially transitioning from 1.5.x to 2.x, they really have nicely generalized support for object types in NumPy arrays. I started off needing a few changes to Pandas to support this implementation initially, but over time every single change I originally needed was rendered unnecessary. I did manage to get some changes merged upstream (and have another one in the works, see pandas-dev/pandas#55825). That latter one is not because numpy doesn't tolerate object types, but because Pint Quantities have their own opinions about being put into arrays.
My initial thoughts are informed by the One of the very last things I needed to do was to ensure that the block managers would not mess up allocation of arrays that initially contained only floats, but which would later need to contain an uncertainty. I realized this was the same problem if one started with an array of float64 and then something later changed one entry to a complex128. I went big and added comprehensive complex testing to Pandas (pandas-dev/pandas#54761) which did indeed expose the same problem. So I proposed using the uncertainties changes to solve the complex128 problems. The Pandas folks fixed that by always allocating a fresh array if something didn't fit, and those changes were no longer neccessary.
Happy to continue the discussion, and happy to prototype. Just wanted to share my experiences so far. |
there is this does it do what you need on the uncertanty progopation side? |
I reviewed that approach in this comment: hgrecco/pint#1797 (comment) Which is why I re-upped my suggestion to merge the changes I developed instead. If I missed something, I would be happy to look again. |
I made a start at this, enough to show what I'm thinking with code:
Main benefits to this approach are:
Do you have an example of this? I'm not familiar with uncertainties. I started off by inherting from BaseMaskedArray, which uses a mask to know where the nans are. This may help with the nans issue, or it may not be needed. It's also not public... If avoiding the nans issue isn't needed it may be better to reserrect pandas-dev/pandas#45544 |
Who will be maintaining UncertaintyArray? Is this something we need to get the Pandas folks to adopt? Numpy? Pint depends on the As for the wrapping/unwrapping, a better way to say this is that uncertainties does is magic not by how it specializes arrays but by wrapping all the NumPy ufuncs that operate on uncertainty values (though as NumPy and uncertainties have diverged there are unintentional gaps (see lmfit/uncertainties#163 (comment)). unumpy arrays are not derived from numpy arrays, they wrap/reimplement everything. Which is why we have the ugliness of unp.isna(uarray) vs. pd.isna(PandasArray). It seems like UncertaintyArray could implement the necessities of uarrays while providing a pandas-friendly interface so we can use isna naturally. Which would be great. Let me know if you'd like to share a branch and collaborate. My code certainly exercises uncertainties, Pint, and Pint-Pandas! |
It should either be in uncertainties or in another module, uncertainties-pandas.
Shouldn't, but may need changes like you found for pandas.
Can't remember exactly what the plans were, but I remember reading something, possibly to move that to another module... but noones got around to it in forever.
Yea I don't see that happening in a long time either. There's a few pandas related issues under uncertainties so I think that's where to make a PR. It may be better in another package to prevent a dependency on pandas.
Numpy functions are a long way off! but I suspect you can do similar to how pint does with |
xref: pandas-dev/pandas#14162, pandas-dev/pandas#20993, and pandas-dev/pandas#22290. The first two especially give strong encouragement to bring uncertainties into Pandas. I think that many of the 2.x changes make this much easier, though no clue yet on how to deal with the |
None of those issues imply the UncertaintyArray should live in pandas! Just that one should be made. Indeed in the first linked issue
I'd be very suprised if they'd merge it into pandas, hence my uncertainties or uncertainties-pandas suggestion. |
Signed-off-by: MichaelTiemann 72577720+MichaelTiemannOSC@users.noreply.github.com
pre-commit run --all-files
with no errors